Thinking about FA

PCA

Seek to find orthogonal factors to minimize reconstruction error.

\[\mathscr{Loss} = X - ZW^T\]

X is the \(NxD\) matrix of observed scores, \(Z\) are the \(N x L\) component scores where L < D, while \(W\) are the \(DxL\) weights, typically referred to as the factor loading matrix.

Each component, or factor, created accounts for less of the variance in the original data. With all components,

\[X = ZW^T\]

The MNIST data contains 28 by 28 pixel images of digits 0-9. If we unroll the data to 784 columns, each row represents a single digit. We can see in the following how well we can reconstruct a digit via PCA. Even with only two components we can get a sense of what we’re looking at. With all components the reconstruction is perfect.

Example

Let’s see an example with more digestible results. The following data are 16 multiple choice ability items taken from the Synthetic Aperture Personality Assessment (SAPA) web based personality assessment project. There are 1525 total subjects, and four categories of questions.

  • reason: verbal reasoning questions
  • letter: In the following alphanumeric series, what letter comes next?
  • matrix: A matrix reasoning task
  • rotate: spatial rotation task
library(psych)
cog_ability = psych::iqitems


Let’s do a PCA. We’ll look at four components, though remember that PCA technically returns as many components as there are variables. You’re just seeing part of the results. Note that PCA almost universally requires you to standardize the data, so that each variable has the same variance. Otherwise you’ll just get components reflecting the relative variances. In what follows the PCA is done on the correlation matrix, which amounts to the same thing.

pc = principal(cog_ability, 4)
pc
Principal Components Analysis
Call: principal(r = cog_ability, nfactors = 4)
Standardized loadings (pattern matrix) based upon correlation matrix
            RC1   RC2   RC3   RC4   h2   u2 com
reason.4   0.60  0.01  0.14  0.09 0.38 0.62 1.2
reason.16  0.67  0.07  0.11  0.04 0.47 0.53 1.1
reason.17  0.50  0.26 -0.08  0.11 0.33 0.67 1.7
reason.19  0.64  0.02  0.14  0.04 0.43 0.57 1.1
letter.7   0.71  0.05  0.02 -0.01 0.51 0.49 1.0
letter.33  0.35 -0.06  0.06  0.58 0.46 0.54 1.7
letter.34  0.64  0.04  0.06 -0.03 0.42 0.58 1.0
letter.58  0.43 -0.17  0.07  0.42 0.40 0.60 2.4
matrix.45  0.64  0.11  0.08  0.10 0.43 0.57 1.1
matrix.46 -0.11  0.26  0.08  0.61 0.46 0.54 1.5
matrix.47 -0.02  0.17  0.06  0.66 0.46 0.54 1.2
matrix.55  0.32  0.32  0.11  0.11 0.23 0.77 2.5
rotate.3   0.09  0.72  0.27  0.12 0.61 0.39 1.4
rotate.4   0.09  0.81 -0.12  0.10 0.70 0.30 1.1
rotate.6   0.17 -0.06  0.84  0.06 0.74 0.26 1.1
rotate.8   0.14  0.22  0.77  0.14 0.68 0.32 1.3

                       RC1  RC2  RC3  RC4
SS loadings           3.28 1.55 1.49 1.40
Proportion Var        0.20 0.10 0.09 0.09
Cumulative Var        0.20 0.30 0.39 0.48
Proportion Explained  0.42 0.20 0.19 0.18
Cumulative Proportion 0.42 0.63 0.82 1.00

Mean item complexity =  1.4
Test of the hypothesis that 4 components are sufficient.

The root mean square of the residuals (RMSR) is  0.07 
 with the empirical chi square  2033.18  with prob <  0 

Fit based upon off diagonal values = 0.87

First focus on the portion of the output where it says SS loadings . The first line is the sum of the squared loadings1 for each component (in this case where we are using a correlation matrix as the basis, summing across all 16 possible components would equal the value of 16). The Proportion Var tells us how much of the overall variance the component accounts for out of all the variables (e.g. 3.28 / 16 = 0.2). The Cumulative Var tells us that 4 components make up over 48% the variance. The others are the same thing just based on the 4 retained components rather than all 16 variables (i.e. the cumulative explained variance would = 1). We can see that the second component accounts for less variance, and this would continue with additional components, where each accounts for a decreasing amount of variance.

Some explanation of the other parts of the output:

  • h2: the amount of variance in the item/variable explained by the (retained) components. It is the sum of the squared loadings, a.k.a. communality. For example, the first reasoning item shares only 38% of its variance with these components.
  • u2: 1 - h2
  • com: A measure of complexity. A value of 1 might be seen for something that loaded on only one component, and zero otherwise (a.k.a. perfect simple structure).

Loadings, also referred to as the pattern matrix, in this scenario represent the estimated correlation of an item with its component, and provide the key way in which we interpret the factors. As an example, we can reproduce the loadings by correlating the observed variables with the estimated component scores2.

cor(cog_ability, pc$scores, use = 'pair') %>% round(2)
            RC1   RC2   RC3   RC4
reason.4   0.60  0.01  0.14  0.09
reason.16  0.67  0.07  0.11  0.04
reason.17  0.50  0.26 -0.08  0.11
reason.19  0.64  0.02  0.14  0.04
letter.7   0.71  0.05  0.02 -0.01
letter.33  0.35 -0.06  0.06  0.58
letter.34  0.64  0.04  0.06 -0.03
letter.58  0.43 -0.17  0.07  0.42
matrix.45  0.64  0.11  0.08  0.10
matrix.46 -0.11  0.26  0.08  0.61
matrix.47 -0.02  0.17  0.06  0.66
matrix.55  0.32  0.32  0.11  0.11
rotate.3   0.09  0.72  0.27  0.12
rotate.4   0.09  0.81 -0.12  0.10
rotate.6   0.17 -0.06  0.84  0.06
rotate.8   0.14  0.21  0.77  0.14

It can be difficult to sort it out just by looking at the values, so we’ll look at it visually. In the following plot, stronger loadings are indicated by blue, and we can see the different variables associated with different components.


Interpretation is the fun, but commonly difficult, part. In this case, the variables don’t appear to be grouping like we’d expect, except for some of the reasoning scores3. It’s worth mentioning the naming fallacy at this point. Just because we associate a factor with some concept, doesn’t make it so. In addition, for PCA the goal is not interpretation, but to reduce the data while retaining as much of the variance as possible.

PCA Summary

Why do it?

  • Reduce the dimensions of the data
  • Retain as much variance as possible

Issues

  • It might not be the best choice with some data types
  • It is not the choice to make if you are wanting to understand the covariance in the data (though it will often agree with methods that do)
  • Non-numeric or mixed data types

Other stuff

  • Scale the variables
  • Biplots are ugly

FA

\[X \approx ZW^T\]

fa_model = fa(cog_ability, 4, rotate='oblimin')
fa_model
Factor Analysis using method =  minres
Call: fa(r = cog_ability, nfactors = 4, rotate = "oblimin")
Standardized loadings (pattern matrix) based upon correlation matrix
            MR1   MR3   MR2   MR4   h2   u2 com
reason.4   0.53  0.04 -0.04  0.04 0.30 0.70 1.0
reason.16  0.61  0.03  0.03 -0.04 0.40 0.60 1.0
reason.17  0.43 -0.05  0.11  0.08 0.22 0.78 1.2
reason.19  0.57  0.04 -0.03  0.00 0.34 0.66 1.0
letter.7   0.67 -0.03  0.03 -0.09 0.43 0.57 1.0
letter.33  0.33  0.02 -0.04  0.27 0.21 0.79 2.0
letter.34  0.57  0.00  0.00 -0.05 0.32 0.68 1.0
letter.58  0.38  0.03 -0.07  0.14 0.19 0.81 1.4
matrix.45  0.58  0.00  0.00  0.07 0.36 0.64 1.0
matrix.46 -0.05  0.01  0.04  0.49 0.25 0.75 1.0
matrix.47  0.03  0.03  0.05  0.37 0.17 0.83 1.1
matrix.55  0.27  0.04  0.09  0.15 0.15 0.85 1.8
rotate.3   0.00  0.25  0.46  0.08 0.36 0.64 1.6
rotate.4   0.02 -0.04  0.80  0.00 0.64 0.36 1.0
rotate.6   0.05  0.65 -0.11 -0.03 0.43 0.57 1.1
rotate.8  -0.01  0.68  0.08  0.04 0.49 0.51 1.0

                       MR1  MR3  MR2  MR4
SS loadings           2.68 1.01 0.95 0.60
Proportion Var        0.17 0.06 0.06 0.04
Cumulative Var        0.17 0.23 0.29 0.33
Proportion Explained  0.51 0.19 0.18 0.11
Cumulative Proportion 0.51 0.70 0.89 1.00

 With factor correlations of 
     MR1  MR3  MR2  MR4
MR1 1.00 0.40 0.19 0.16
MR3 0.40 1.00 0.12 0.37
MR2 0.19 0.12 1.00 0.38
MR4 0.16 0.37 0.38 1.00

Mean item complexity =  1.2
Test of the hypothesis that 4 factors are sufficient.

The degrees of freedom for the null model are  120  and the objective function was  2.6 with Chi Square of  3944.59
The degrees of freedom for the model are 62  and the objective function was  0.06 

The root mean square of the residuals (RMSR) is  0.02 
The df corrected root mean square of the residuals is  0.02 

The harmonic number of observations is  1523 with the empirical chi square  100.71  with prob <  0.0014 
The total number of observations was  1525  with Likelihood Chi Square =  91.43  with prob <  0.0089 

Tucker Lewis Index of factoring reliability =  0.985
RMSEA index =  0.018  and the 90 % confidence intervals are  0.009 0.025
BIC =  -363.01
Fit based upon off diagonal values = 0.99
Measures of factor score adequacy             
                                                   MR1  MR3  MR2   MR4
Correlation of (regression) scores with factors   0.90 0.82 0.84  0.69
Multiple R square of scores with factors          0.80 0.67 0.70  0.48
Minimum correlation of possible factor scores     0.61 0.33 0.40 -0.04

FA for measurement

“Confirmatory” Factor Analysis

modelCode = "
  verbal =~ reason.4 + reason.16 + reason.17 + reason.19
  spatial =~ rotate.3 + rotate.4 + rotate.6 + rotate.8

  # not necessary
  # verbal ~~ spatial

  # if you don't want them correlated
  # verbal ~~ 0*spatial
"

famod = cfa(modelCode, data=cog_ability)
summary(famod, standardized=T, rsq=T, nd=2)
pars = parameterEstimates(famod, standardized=T)

SEM

IRT

As latent linear models

FA

\[ X \sim \mathcal{N}(ZW^T + \mu, \Psi) \] \(\mu\) are the intercepts, \(\Psi\) is a \(DxD\) covariance matrix.

Probabilistic PCA

For probabilistic PCA \(\Psi\) is \(\sigma^2I\).

PCA

For standard PCA, \(\sigma \rightarrow 0\),

Count-based Matrix Factorization

Commonly our data regards counts or otherwise compositional data. In this case we can use techniques that are better suited to such data.

NMF

Non-negative matrix factorization is similar to PCA, just with the constraint that scores and weights be positive. It is (in the usual implementation) identical to probabilistic latent semantic analysis.

LDA

Latent Dirichlet Allocation takes a different approach to deal with count data. The classic example regards text analysis, where LDA is applied to word frequencies across topics. Consider a document-term matrix where each row regards a document, and each column a word or term. LDA looks for latent topics that are probability distributions over the terms. There is a probability distribution for the topics as well as the terms, and given both, we will see some occurrence of the term in each document. Each document can be seen a mix of topics.

In the factor analysis sense, each variable (the term) will have some probability of occurring for a given topic, i.e. factor. One can think of the term/variable probabilities for a specific topic similar to how we did loadings in the factor analysis sense. Every variable will have some non-zero probability of occurrence for a given factor, but often they are essentially zero. Factor scores are the document topic probabilities.

Given the probabilities of topics and terms within topics, there is a multinomial draw of counts for an observation. With this probabilistic result, there isn’t a ‘reconstruction’ connotation with LDA. However we can simulate a mulinomial draw based on the total counts we see. For example, let’s assume 5 terms seen a total of 20 times given some probability.

t(rmultinom(1, size = 20, prob=5:1)) # first term is most likely, 5th term is least likely
     [,1] [,2] [,3] [,4] [,5]
[1,]    4    7    5    2    2

Those are the counts we see for a given observation.

LDA wouldn’t be the best tool if your goal is prediction, and you don’t care much about interpretation.

LDA is basically NMF with Dirichlet priors on the topics.

Latent classes

Other

Issues


  1. They are the eigenvalues of the correlation matrix. In addition, they are the diagonal of the crossproduct of the loading matrix.

  2. These results have been rotated, something practically no one using PCA actually does.

  3. Results like this are more the rule than the exception in my experience. Not enough initial development is typically done with scales, and when other people use them in other settings, it often ends with disappointment. Just think, many psychological scales are developed on freshman psychology students. How well do you think that will generalize?